A Pilot Study
University of Kentucky
2024-10-25
Dr. Anton Vinogradov,(recent!) PhD, Computer Science
Database of Etymological Roots Beginning in PIE
In DERBi PIE, having an integrated semantic system could provide automated answers to questions such as:
Are certain sound sequences associated with certain meanings or semantic spheres?
Are certain morphological derivations associated with certain meanings or semantic spheres?
How have meanings changed over time into the various branches and daughter languages?
Second approach is one more in line with what is done in present-day NLP – identifying semantic relationships through word embeddings.
To do so, we must:
Process a corpus for tokens and then lemmas;
Analyze the environments in which these lemmas occur;
Take this information to construct a semantic “hyperspace”
Plots like these can be created with generated vectors!
If you were to plot out every vector generated from one of these models, you would have a hyperspace or a semantic space, with the dimensions of each word vector essentially acting as coordinates.
The closer two words lie to each other in this space, the closer in semantic value they are.
It’s possible to adjust how many dimensions you generate for each vector, but in general the more the better.
Fancy formula from Wikipedia
Models trained on French & Spanish Wikipedia articles, include both vocabulary and non-words/words containing non-language characters, the latter of which were removed (7.3%, 6.6%, respectively)
Remaining words lemmatized, furthering reducing vocabulary by roughly 10%
(Steps 3 & 4 done to hedge for translation errors and to avoid outliers)
Johannson and Pietro Nina 2015: build hyperspaces from systems like WordNet
We should be able to do this for many IE languages (mostly modern)
For languages without WordNets (like PIE), we “translate” the lexicon (< DERBi PIE) into a WordNet structure
Eliminate any matchings that are untrue (like *i̯eh₂- “drive” = “hit a golfball”)
But same problems from before remain - less precise, must be done manually
Probably the best course of action is to stick with the Descendant Model strategy, but:
Utilize LLMs (such as GPT) for modelling (hyperspace) for greater precision and differentiation of polysemy
Instead of Google Translate, use bilingual dictionary or LLM
Add additional Romance languages;
when happy with results, move on to other subbranches (likely Slavic or Indic)
Iterate, iterate, iterate!